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Anonymization method 



The invention relates to a method for anonymizing 
sensitive data within a data stream, 

5 

Information for long-term storage is stored in 
databases. The value of such information collections is 
considered to be an essential asset of organizations. 
Owing to the sensitivity, access to databases is 

10 generally restricted, i.e. access is possible only for 
authorized users in accordance with their user rights 
profiles. In a user rights profile it is possible to 
define who can access which data in which modes (for 
example reading, writing) . A current example is when it 

15 is not possible for every employee of a company to 
access personnel data. It is also possible for 
employees to access, on a "need-to-know" principle,, 
only that information which they require to carry out 
their duties. All other information is barred. An 

20 administrator is responsible for allocating the access 
rights, and the reliability of the data protection 
depends essentially on this administrator. 

To provide data security, anonymization methods are 
25 frequently used which anonymize the data which is not 
to bo accessed. Such methods are used in particular if 
data is to be transferred to a database in the form of 
a data stream, in which case it is necessary to ensure 
that there is no unauthorized access to the data on the 
30 transmission path. An application example of this is 
the dispatch of a data stream by e-mail. Transmitters 
and receivers then have full access rights to all the 
data contained in the database. The data is encrypted 
before transmission so that attackers within the 
35 Internet cannot access the data. The receiver decrypts 
the data, snd can access it completely. 
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In the known methods for protecting databases, 
authorization and testing of user rights is typically 
performed at the* front end of the database. This 
applies, for example, to DB2™ from IBM- If a higher 
5 level of user access rights is required, there are 
commercial products, for example RACF™ (Resource Access 
Control Facility) from IBM. However, access control is 
also performed here by an administrator. 



10 A classic situation in which the conventional methods 

0 are inadequate is an outsourcer/insourcer relationship, 
5; A n outsourcer has certain services provided by an 
|jj insourcer and provides the insourcer with all the data 

0 1 necessary to do so, said data being stored in a 
71 15 database at the insourcer 1 s end. If, for data 
O protection reasons or for reasons of customer 
s protection, the outsourcer wishes itself to control the 
S dissemination of customer- \ dentifying data, the known 
m anonymization methods are used either to prevent access 
U> 20 to the entire database or to place the selective 
O control of access to specific data under the aegis of 

an administrator which is located at the insourcer' s 
premises. Therefore, it would basically also be 
possible to access sensitive data. 

25 

The object of the present invention is to make 
available a method which permits a database to be 
accessed, but excludes certain data within this 
database from access without destroying the 
30 relationship between the excluded data and the rest of 
the data. It should be possible to transfer the 
database to third parties for processing of the non- 
protected data, without losing control of access to the 
protected data. 



35 



According to "he invention, a method for anonymizing 
sensitive data within a data stream is proposed, having 
the following steps: 
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I 



a) the sensitive date field is compressed, 

b) the sensitive data field is anonymized, 

5 

c) rhe anonymized sensitive data field is marked 
within the data stream by means of start and stop 
characters . 

10 According to the invention, the sensitive data is 
selectively anonymized within a database. The 

O anonymized data fields are provided with a start 

character and a stop character in order to identify 

JUj them for later de-anonymization . 

? 15 

2 The method according to the invention can be used in 

£3 particular when a database user stores data in a 

* database, and some of the data items are to be 

^ processed by a database operator. While the database 

m 20 user is authorized to read all the data, sensitive 
IM* data, for example customer-identifying information, is 

y to be anonymized as far as the database operator is 

concerned, and it is to be impossible for said database 
operator to de-anonymize said information. The 
25 anonyiriization information remains with the database 
user. The non-anonymi zed data can be evaluated and 
processed by the database operator. The relationship 
between the data remains unchanged. 



30 The sensitive data can be, for example, customer- 
identifying information, and it is to be possible for 
the data assigned to the customer to be read for the 
purpose of statistical evaluation. The database can be 
partially anonymized with the anonymi zation method 

35 according to the invention and passed on to third 
parties for statistical evaluation and processing. The 
customer-identifying data cannot be read by the third 
party. The control over which user access rights are 
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assigned to which persons remains with the database 
user. The relationship between the processed data and 
the respective anonymized data, such as customer name, 
remains unchanged. After the evaluated or processed 
5 database is returned to the database user, the database 
user can perform a de-anonymization and use the entire 
processed database. 

The method according to the invention can, in 
10 particular, be applied advantageously even if the 

sensitive data fields have a predefined field length. 

However, it is self-evident that the method can also be 

appropriately applied without restriction when there 

arc unlimited field lengths. Even if the following 
15 statements relate increasingly to sensitive data fields 

of a predefined field length, this is not to be 

understood as restrictive . 



m 



fit The data can advantageously be compressed before the 

20 sensitive data field is anonymized. In the case in 
which the data field is completely filled, this 
provides the space for the addition of start and stop 
characters for marking the anonymized data field. The 
marking is necessary for later de-anonymizat ion of the 
25 data field. 

If, in any case, the data field is not completely 
filled, or if the data is compressed by the compression 
to such an extent that there is still space remaining 
30 in the data field, the data field can be filled in by 
fill characters before the ancnymization. 

There are, in particular, two possible methods 
available for anonymi zing the data field, namely 
3 5 psendonymization and encryption. 



Tf the data field is completely filled, 
pseiidonymization is preferably performed. To do this, 
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the length of the pseudonym used has to be selected in 

such a way that space remains for start and stop 

characters in the data field after the 
pseudonymizat ion. 

If there is still space within the data field, the data 

field is preferably at least partially filled by fill 

characters, in particular with random values, and 
subsequently encrypted. 



O Filling the field with random values ensures that 

~( isonomies are resolved. For example, it is necessary 

y that frequently occurring names, such as Miiller, Meier 

yl etc. in the German -speaking world are encrypted 
T 15 differently so that by analyzing the frequency of the 

■ H data it is not possible to draw conclusions about the 

* data. This is done by filling the data field with 

Jfl random values and subsequently encrypting it. 

Hi 

fs& 20 In a preferred embodiment of the method according to 
y the invention, information relating to the key used for 

s%? the encryption is also stored in the encrypted data 

field. This key information has the purpose of enabling 
the database user to decrypt the encrypted data. In 
25 this way, it is possible to use various keys for 
encrypting the data, the corresponding key information 
for identifying the key being stored in each case 
within the field. Of course, the filling level of the 
field must be carried out in such a way, or generated 
30 by means of data compression in such a way that space 
remains for storing key information. 



The detection of which data is encrypted or decrypted 
can be implemented by clearly marking what is referred 
35 to as start and stop characters, such as "{" and " } " . 
In the system in question, it is not permitted to use 
the start and stop characters apart from for marking 
encrypted data. This approach has the advantage that it 
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is independent of the applications which operate on the 
data . 

If there is no single unambiguous start character in 
5 the system in question, a set start character can be 
used. The same applies to the stop characters. In the 
simplest case, the set start character could be 
composed of a character which is identical to the stop 
character. However, this has in turn the disadvantage 
10 that synchronization in a fault situation is no lonqer 
possible solely on the basis of the knowledge of start 
and stop characters. 

The method according to the invention is explained in 
15 more detail below by means of various examples and with 
reference to the appended figures, in which: 

Figure 1 shows the marking of sensitive data which is 
to be anonymized; 



20 



30 



Figure 2 shows the flowchart of an encryption and/or 
decryption process; 

Figure 3 shows the flow of an encryption process; 

Figure 4 shows the structure of an encrypted data 
field; 

Figure 5 shows the flow of a decryption process. 

The anonymization method should fulfill the following 
requirements : 



1. Frequently occurring data (for example the 
35 frequently occurring names Muller, Meier etc. in 

the German-speaking world) should be encrypted 
differently. This is intended to prevent 
conclusions being able to be drawn about the data 
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itself by analyzing the frequency of data. The 
intention is to resolve the isonomies in the data. 

2. The length of a data field to be encrypted is 
5 restricted by a fixed maximum length which is 

predefined essentially by the database design. 
Field types, for example numeric or alpha numeric, 
must not be changed. This requirement permits 
subsequent integration of the method without an 
10 operator of a database system having to change his 

applications in order to process the data. 

3. Each encrypted data field contains all the 
information, apart from keys and systernwide 

15 parameters, for decryption. It is therefore 

possible to process each data field independently. 

The aforesaid three properties are to be fulfilled 
simultaneously by the selected anonymization method. 

20 

In order to carry out the method, the filling level 
(compression rario) of the data field to be anonymized 
is firstly checked. It must be ensured that there is 
still sufficient space within the predefined fixed data 
25 field length after the encryption in order to store a 
start character and a stop character and information 
for the key used. 

If the filling level of the data field is too high to 
30 be able to carry out encryption with the aforesaid 
criteria, the data field is firstly compressed. If the 
compression of the data field does not give rise to a 
sufficiently small field size either, pseudonymization 
is carried out. The pseudonym must be selected in such 
3 5 a way that the condition predefined under 2.) in terms 
of the filling level of the data field is fulfilled. 
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If the filling level of the data field is sufficiently 
small to permit encryption of the data field, the 
encryption is performed. To do this, the data field is 
firstly filled to the maximum possible filling level 
5 with random values. 

When the information content of the data field is 
small, data compression can be performed before the 
filling in order to be able to resolve isonomies 
10 better. 

^ The encryption is then performed. The encryption 

/a algorithm used can be selected as desired. Current 

01 algorithms are, for example, IDEA (International Data 

15 Encryption Algorithm) or DES (Data Encryption 
Standard) - 

O The encrypted data field is then marked with a start 

character and a stop character. In addition, 
|I 20 information relating to the key used for the encryption 
O is stored in the data field at a previously defined 

^ position. 

The following example will illustrate the method: 

25 

The data field length is 40 characters. The content of 
the unencrypted data field is the name "Meier". " { ,T is 
used as the start character, and T is used as the 
stop character. The data field is filled to the full 
30 field length and provided with start and stop 
characters, that is to say: 

{Meier } - 

3 5 The 40 characters between the start and stop characters 
are processed by the method. The encryption then 
results in a 40 character-long data field including the 
start and stop characters, that is to say for example: 
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{ch7 4nHhdjqa yjas8} . 

In the encrypted data fields, k bits are provided for 
5 marking the key used from a key set. It is thus 
possible to represent 2 k different keys. As a result of 
additional information being incorporated into the 
encrypted data fields r for example set start 
characters, key bits and information relating to the 
10 initialization sector used for the encryption 
algorithm, it is necessary to compress the data fields 
which are to be encrypted. 

in the appended figure 2, the encryption and decryption 
f: 15 of data fields is illustrated. The individual steps are 
Q explained in more detail below, 

if: The description of the method depends on the following 

ly conditions: 

y - Each character is represented by a byte (for 

m example ASCII or FBCDIC code) . Before the 

encryption or decryption, all the characters of a 
field are converted into an internal character set 
25 (ASCII) and then converted again appropriately. 

The different parameters are defined as follows: 

1. a character set (for example 91 specific 
30 characters of the EBCDIC code); 

2. a set of the start characters and stop 
characters for encrypted data fields, which 
are not included in the character set; 



35 



an alternative character for characters which 
do not belong to the character set (is part 
of the character set); 
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A. possibly necessary fill characters (is part 
of the character set); 

5 5. method parameters for the compression 

process; 

6_ information on how the original data field is 
to be subsequently processed as when 
10 compression is not successful; 

^ 7. information on the representation of bit 

i = £ sequences as sequences of permissible 

fen characters; 

T. 15 

J— j 8. information on which of the keys from the key 

%css? 

^ set is to be used. 

Depending on the power of the character set, individual 
U* 20 bit segments can each be converted to form character 
O sequences of a specific length (for example, given a 

* y character set of 91 characters, every 13 bits can be 

respectively converted effectively into two 
characters) . The best would be to perform a "common" 
25 conversion of the entire bit sequence by considering 
the sequence as a binary number and representing this 
number in the base b = power of the character set. 



A method for effectively encoding on as large as 
30 possible bit sequence into a data field of a predefined 
length, which data field is provided for implementation 
on systems with 32-bit processors, is described below. 
Firstly, for a given character set of the size b the 
following is calculated once before the basic 
35 initialization ("In" represents here the natural 
logarithm) : 
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* the minimum value of x/y is determined for 
integral y from 1 to 32 and integral x > 

y*In(2)/In(Jb) . 
For example: when b = 91, a minimum is obtained 
5 when x - 2 and y — 13, 

• for all values x' of 1 to x-1, the respective 
integral maximum y'U') is calculated by means of 
y ' <x f ) *ln (2) /In ( b) <x ' . In addition, y r (0) = 0 is 

10 selected. 
Q Example: when b = 91 and x = 2, the following is 

obtained y'(l) = 6. 

A bit sequence can then be converted into a data field 
of the length d as follows: 

1. In each case y bits are converted into in each 
case x characters. 

Example: when b = 91 r every 13 bits are replaced 
by 2 characters each. 

2. If the given data field length d cannot be divided 
by x f y'(x') bits are converted into the remaining 
x r characters. In the example, 6 bits are also 
represented by a character. 

If s is assumed to be the number of start characters 
used in the encrypted data field and 

30 L(d,b,s)=L=((d-s-l)DTV x) *y+y' ( (d-s-l)MOD x) 

will be assumed to be the number of bits which can be 
converted into a data field of the length (d-s-lj by 
applying the above method. The value (d-s-1) results 
35 from the fact that the set of start characters of the 
length s and the stop character must be included in the 
encrypted data field. 
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When d = 30, b = 91 and <? = 1 , the following is 
obtained for example L = 14 * 13 + 0 = 182, 
when d - 15, b - 91 and s = 3, L - 5 * 13 + y'U) 
= 65 +- 6 - 71. 

5 

Let m - (I - lc - length of compressed bit sequence) . 
The bits still available after the compression, k bits 
are provided for the number of the key used. All sorts 
of methods can be used for the compression. Depending 
10 on this number m, it is defined how the initialization 

0 vector will be made available for the encryption and 
%j coded . 

1 £""5 

lp The suitable selection of the initialization vector 

^ 15 ensures that isonomies are resolved. In principle, the 

K following possibilities can be used for this: 

O • use of random numbers 

1'U 

t : 20 • use of counters 

m Various keys of the key set composed of k keys can be 

used with staggered timing. During the encryption it is 
necessary to define which of these keys is to be used. 
25 The key number is encoded by k bits. 

If the bit sequence composed of k bits for the number 
of the key, the bits for the encoding of the 
initialization vector and the bits for the compressed 
30 data field should be shorter than necessary, i.e. 
smaller than L, it is filled in at the end with "0" 
bits until the maximum admissible bit length L is 
reached . 

35 The compressed data field content is encrypted. 



The encryption can be carried out with a block 
encryption algorithm and the stored secret key in the 
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CBC mode, the last block of the length j {if this is 
shorter than 64 bits) being encrypted in the CFB mode 
(see for example ISO/IEC 10116, Information Technology 
Modes of Operation for an n-bit Block Cipher 
5 Algorithm, 1991) . 

In the consideration it is assumed that the typical 
block length of 64 is used. It is clearly possible to 
generalize to other block lengths. In another variant, 
10 what is referred to as stream cipher algorithms, could 
O be used directly for character-by-character encryption. 

%d Finally, in order to form the encrypted data field the 

P? character sequence which is obtained is inserted 

t* 15 between the set srart character and the stop character. 

~ As soon as the start character sequence is detected in 

O the data stream, the subsequent characters are input 

into an internal memory until the stop character 
|s4f 20 appears . 
Q 

lfe if the start character sequence is among the subsequent 

characters, the process of storing is terminated and 
started at the new start character sequence. If a stop 

2 r j character has still not been detected after a 
predefined maximum length, the process is also 
terminated and the next start character sequence is 
looked for again. If there are fewer than a predefined 
lower limit of characters between the set start 

30 character and the stop character, the storage is also 
terminated . 

Not every data field can be compressed to such an 
extent that the desired number of bits is available for 
35 the initialization vector. The shorter the data set 
length, the worse the compression, with the consequence 
that fewer bits are available for the initialization 
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vector and there are thus fewer possible ways of 
generating various ciphertexts for a data field. 

In such a case, there are in principle the following 
three possible ways of continuing: 

1. Shortening the data field until sufficient 
compression can be achieved. However, this is 
inevitably associated with loss of information. 

2. The affected data field is not encrypted, and it 
will thus remain in plain text. This can possibly 
be acceptable if this occurs rarely in relation to 
the overall set of data fields to be encrypted. 

3. Use of the pseudonymization approach, which is 
described below. 

Tt may be found that no adequate compression of the 
data records can be achieved when there is a predefined 
fixed field length. Tf shortening or passing on in 
plain texr is not acceptable, the complete "masking" of 
all the selected data records can be implemented by 
means of rhe pseudonymization approach. 

Data fields and pseudonyms can be linked, and vice 
versa, in a way analogous to an alias. The information 
is contained in a table. 

30 Leutheusser-Schnarrenberger <-> X1BXE.....H 

Garmisch-Partenkirchen <-> X2BXD9....Z 

If the pseudonymization is necessary at a plurality of 
35 spatially separated locations, the pseudonyms which' are 
allocated to all the locations must be reserved at all 
the other locations (replication) . This means 
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additional communication costs. Additional measures for 
protecting the transmission are necessary. 

The encrypted data fields can be stored over relatively 
5 long time periods, for example 5 to 15 years. The use 
of different keys staggered over time is advisable for 
the following reasons: 

• If the key becomes known, the entire set of 
10 encrypted data fields must be considered as being 

exposed . 

• The set of encrypted data fields which is 
available ro a crypto analyst is significantly 

15 smaller if a plurality of keys are used. 



L For this reason, the method provides k keys for each 

fy set of database users which cooperate. 

m 

20 The keys can be generated in a trust center 
Wt (trustworthy third-party entity) which makes available 

the necessary technical and organizational environment. 

Various sets of database users which do not cooperate 
25 with one another should have various sets of keys which 
do not have any dependence on one another. This 
excludes the possibility of a set of database users 
being able to access database information from the 
other set of database users. 



30 



35 



The key management is composed of the following 
functions : 

1. Generation of keys 

A key packet composed of k keys is generated. A 
hardware random number generator is particularly 
suitable for this. In the operation after the 
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generation of the keys, the generated keys can be 
stored on a key storage medium, for example a 
smart card or PCMCIA card. These media can be 
configured in such a way that they carry out the 
5 cryptographic calculations themselves, or issue 

keys only after authentication has been performed . 



2. Distribution of keys. 



11 20 



10 From the location where the keys are generated, 

the keys can be transported on a key storage 
medium to the place of use (terminal) or to a 
secure place of storage (back-up) . 

15 3. Introducing keys into terminals 

A terminal is defined by the fact that it can 
carry out the necessary encryption and decryption 
processes. Such a device can be a specially 
developed piece of hardware or a PC. The keys can 
be loaded into a terminal from the key storage 
medium after prior authentication has been 
performed, or the terminal can receive orders to 
perform encryption and decryption. The latter case 
25 requires a corresponding resource of the key 

storage medium, but has the advantage that the 
keys never leave the key storage medium. 



A . Destroying keys 

30 

If a cooperating set of database users no longer 
requires a key package composed of k keys, it is 
possible to destroy the keys by means of suitable 
measures, for example by destroying the key 
3 5 storage medium and deleting the key package from 

the corresponding terminals, if present. 
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1. A method for anon yrni zing sensitive data within a 
data stream, having the following steps: 

5 

a) the sensitive data field is compressed, 

b) the sensitive data field is anonymized, 

c) the anonymized sensitive data field is marked 
within the data stream by means of start and 

10 stop characters. 

2. The method as claimed in claim 1, characterized in 
that the sensitive data field is filled up by fill 
characters before the anon ymi zat ion. 

15 

3. The method as claimed in claim 1 or 2, 
characterized in that the data to be anonymized is 
pseudonymi zed . 

20 4- The method as claimed in claim 1 or 2, 
characterized in that the data to be anonymized is 
encrypted. 

5. The method as claimed in claim 4, characterized in 
25 that sensitive data fields are at least partially 

filled in with random values before the 
encryption. 

6. The method as claimed in claim 4 or 5, 
30 characterized in that information relating to the 

key to be used for the encryption is stored in the 
encrypted data field. 

7. The method as claimed in one of claims 1 to 6, 
35 characterized in that the sensitive data field has 

a fixed field length. 
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